Overview of Quantization
17
where, x is a real-valued input (activation or weight), S is a real-valued scaling factor, and Z
is an integer zero point. In addition, the INT function converts a real number to an integer
value via a rounding technique (e.g., round to nearest and truncation). This function is just
a mapping from real values x to some integer value. This method of quantization is also
known as uniform quantization.
Besides, non-uniform quantization methods produce quantized values that are not nec-
essarily uniformly spaced. The formal definition of non-uniform quantization is shown as
qx =
⎧
⎪
⎪
⎪
⎪
⎨
⎪
⎪
⎪
⎪
⎩
q1,
if x ≤Δ1,
...
qi,
if Δi−1 < x ≤Δi,
...
qU,
if x > ΔU.
(2.4)
where qi represents the discrete quantization levels and Δi denotes the quantization steps.
When the value of a real number x falls between the quantization steps Δi −1 and i + 1,
the quantizer Q projects it to the associated quantization level qi. It should be noted that
neither qi nor Δi are evenly spaced.
Nonuniform quantization can achieve higher accuracy for a fixed bit width because
it allows for better capturing of distributions by focusing on important value regions or
determining appropriate dynamic ranges. For example, various nonuniform quantization
techniques have been developed for bell-shaped distributions of weights and activations,
which often exhibit long tails. A commonly employed rule-based nonuniform quantization
method uses a logarithmic distribution, where the quantization steps and levels increase
exponentially rather than linearly.
Recent advances have approached it as an optimization problem to enhance quantization
performance. The goal is to minimize the difference between the original tensor and its
quantized counterpart by adjusting the quantization steps/levels in the quantizer qx.
minq∥qx −x∥2
2
(2.5)
Nonuniform quantization can also be improved by making the quantizer itself trainable.
These methods are called learnable quantizers, and the quantization steps/levels are opti-
mized through an iterative process or gradient descent along with the model parameters.
Overall, nonuniform quantization can better represent data by distributing bits and
unevenly discretizing the range of parameters. However, this quantization type can be chal-
lenging to implement effectively on standard computation hardware such as a GPU and
a CPU. As a result, uniform quantization remains the prevalent method because of its
straightforward implementation and efficient mapping to hardware.
2.1.2
Symmetric and Asymmetric Quantization
The choice of the scaling factor, S, in Eq. 2 is crucial in uniform quantization. S determines
the size of each partition by dividing the range of real values, x, into a specified number of
segments. The value of S affects the granularity of the quantization and ultimately impacts
the accuracy of the quantized representation:
S = β −α
2b −1,
(2.6)
where [α, β] is the clip range and b is the bit-width. The clipping range, [α, β], determines
the range of real values that should be quantized. The choice of this range is crucial, as
it determines the quantization’s precision and the quantized model’s overall quality. This